Corpus Linguistics, Treebanks and the Reinvention of Philology
نویسندگان
چکیده
The fields of corpus and computational linguistics address fundamental goals – and challenge us to rethink the structure – of humanistic research. All work with historical languages is, in some sense, an exercise in corpus linguistics. The Greek and Latin Treebanks illustrate changes in intellectual practice. Linguistic annotation of historical corpora serves a different community and offers a different combination of challenges and opportunities. On the one hand, historical languages such as Greek and Latin have, by definition, no native speakers. At the same time, these corpora have been, and remain, objects of intensive study. The Greek and Latin Treebanks thus have spawned three areas of activity, each of which differs from what we find in corpus linguistics and which collectively constitute a new form of intellectual activity, one that draws upon both the most traditional goals of philology and upon emerging fields such as corpus and computational linguistics.
منابع مشابه
Structured Knowledge for Low-Resource Languages: The Latin and Ancient Greek Dependency Treebanks
We describe here our work in creating treebanks – large collections of syntactically annotated data – for Latin and Ancient Greek. While the treebanks themselves present important datasets for traditional research in philology and linguistics, the layers of structured knowledge they contain (including disambiguated lemma, morphological, and syntactic information for every word) help offset the ...
متن کاملLinguistic Annotation, the Reunification of Linguistics and Philology, and the Reinvention of the Humanities for a Global Age
This paper addresses the critical role that treebanks in particular and linguistic annotation in general must play if the Humanities are to advance the intellectual life of society as a whole. During the twentieth century we saw a rise in specialization that not only separated the practices of philology and linguistics among different researchers but wholly separate (and sometimes conflicting) ...
متن کاملA Web-based Approach To Chinese Word Segmentation
Chinese text processing requires the detection of word boundaries. This is a non-trivial step because Chinese does not contain explicit whitespace between words. Existing word segmentation techniques make use of precompiled dictionaries and treebanks. The creation of dictionaries and treebanks is a labor-intensive process and consequently they are updated infrequently. Furthermore, due to their...
متن کاملHamleDT: Harmonized multi-language dependency treebank
We present HamleDT – a HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. In the present article, we provide a thorough investigation and discussion of a number of phenomena that are comparable across languages, though their ann...
متن کاملHuge Parsed Corpora in LASSY
One of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language p...
متن کامل